Goto

Collaborating Authors

 ego vehicle


Thinking Ahead: Foresight Intelligence in MLLMs and World Models

Gong, Zhantao, Fan, Liaoyuan, Guo, Qing, Xu, Xun, Yang, Xulei, Li, Shijie

arXiv.org Artificial Intelligence

In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.


COVLM-RL: Critical Object-Oriented Reasoning for Autonomous Driving Using VLM-Guided Reinforcement Learning

Li, Lin, Cai, Yuxin, Fang, Jianwu, Xue, Jianru, Lv, Chen

arXiv.org Artificial Intelligence

End-to-end autonomous driving frameworks face persistent challenges in generalization, training efficiency, and interpretability. While recent methods leverage Vision-Language Models (VLMs) through supervised learning on large-scale datasets to improve reasoning, they often lack robustness in novel scenarios. Conversely, reinforcement learning (RL)-based approaches enhance adaptability but remain data-inefficient and lack transparent decision-making. % contribution To address these limitations, we propose COVLM-RL, a novel end-to-end driving framework that integrates Critical Object-oriented (CO) reasoning with VLM-guided RL. Specifically, we design a Chain-of-Thought (CoT) prompting strategy that enables the VLM to reason over critical traffic elements and generate high-level semantic decisions, effectively transforming multi-view visual inputs into structured semantic decision priors. These priors reduce the input dimensionality and inject task-relevant knowledge into the RL loop, accelerating training and improving policy interpretability. However, bridging high-level semantic guidance with continuous low-level control remains non-trivial. To this end, we introduce a consistency loss that encourages alignment between the VLM's semantic plans and the RL agent's control outputs, enhancing interpretability and training stability. Experiments conducted in the CARLA simulator demonstrate that COVLM-RL significantly improves the success rate by 30\% in trained driving environments and by 50\% in previously unseen environments, highlighting its strong generalization capability.


From Zero to High-Speed Racing: An Autonomous Racing Stack

Jardali, Hassan, Pushp, Durgakant, Yu, Youwei, Ali, Mahmoud, Mohamed, Ihab S., Murillo-Gonzalez, Alejandro, Coen, Paul D., Khan, Md. Al-Masrur, Pulivendula, Reddy Charan, Park, Saeoul, Zhou, Lingchuan, Liu, Lantao

arXiv.org Artificial Intelligence

High-speed, head-to-head autonomous racing presents substantial technical and logistical challenges, including precise localization, rapid perception, dynamic planning, and real-time control-compounded by limited track access and costly hardware. This paper introduces the Autonomous Race Stack (ARS), developed by the IU Luddy Autonomous Racing team for the Indy Autonomous Challenge (IAC). We present three iterations of our ARS, each validated on different tracks and achieving speeds up to 260 km/h. Our contributions include: (i) the modular architecture and evolution of the ARS across ARS1, ARS2, and ARS3; (ii) a detailed performance evaluation that contrasts control, perception, and estimation across oval and road-course environments; and (iii) the release of a high-speed, multi-sensor dataset collected from oval and road-course tracks. Our findings highlight the unique challenges and insights from real-world high-speed full-scale autonomous racing.


Prediction-Driven Motion Planning: Route Integration Strategies in Attention-Based Prediction Models

Steiner, Marlon, Wagner, Royden, Tas, Ömer Sahin, Stiller, Christoph

arXiv.org Artificial Intelligence

Personal use of this material is permitted. Abstract-- Combining motion prediction and motion planning offers a promising framework for enhancing interactions between automated vehicles and other traffic participants. However, this introduces challenges in conditioning predictions on navigation goals and ensuring stable, kinematically feasible trajectories. Addressing the former challenge, this paper investigates the extension of attention-based motion prediction models with navigation information. By integrating the ego vehicle's intended route and goal pose into the model architecture, we bridge the gap between multi-agent motion prediction and goal-based motion planning. We propose and evaluate several architectural navigation integration strategies to our model on the nuPlan dataset. Our results demonstrate the potential of prediction-driven motion planning, highlighting how navigation information can enhance both prediction and planning tasks. In driving scenarios that involve interactions between traffic participants, accurately predicting others' future trajectories is essential for effective planning.


Driving is a Game: Combining Planning and Prediction with Bayesian Iterative Best Response

Distelzweig, Aron, Wang, Yiwei, Janjoš, Faris, Hallgarten, Marcel, Dobre, Mihai, Langmann, Alexander, Boedecker, Joschka, Betz, Johannes

arXiv.org Artificial Intelligence

Autonomous driving planning systems perform nearly perfectly in routine scenarios using lightweight, rule-based methods but still struggle in dense urban traffic, where lane changes and merges require anticipating and influencing other agents. Modern motion predictors offer highly accurate forecasts, yet their integration into planning is mostly rudimental: discarding unsafe plans. Similarly, end-to-end models offer a one-way integration that avoids the challenges of joint prediction and planning modeling under uncertainty. In contrast, game-theoretic formulations offer a principled alternative but have seen limited adoption in autonomous driving. We present Bayesian Iterative Best Response (BIBeR), a framework that unifies motion prediction and game-theoretic planning into a single interaction-aware process. BIBeR is the first to integrate a state-of-the-art predictor into an Iterative Best Response (IBR) loop, repeatedly refining the strategies of the ego vehicle and surrounding agents. This repeated best-response process approximates a Nash equilibrium, enabling bidirectional adaptation where the ego both reacts to and shapes the behavior of others. In addition, our proposed Bayesian confidence estimation quantifies prediction reliability and modulates update strength, more conservative under low confidence and more decisive under high confidence. BIBeR is compatible with modern predictors and planners, combining the transparency of structured planning with the flexibility of learned models. Experiments show that BIBeR achieves an 11% improvement over state-of-the-art planners on highly interactive interPlan lane-change scenarios, while also outperforming existing approaches on standard nuPlan benchmarks.


CogDrive: Cognition-Driven Multimodal Prediction-Planning Fusion for Safe Autonomy

Huang, Heye, Yang, Yibin, Fan, Mingfeng, Wang, Haoran, Zhao, Xiaocong, Wang, Jianqiang

arXiv.org Artificial Intelligence

Safe autonomous driving in mixed traffic requires a unified understanding of multimodal interactions and dynamic planning under uncertainty. Existing learning based approaches struggle to capture rare but safety critical behaviors, while rule based systems often lack adaptability in complex interactions. To address these limitations, CogDrive introduces a cognition driven multimodal prediction and planning framework that integrates explicit modal reasoning with safety aware trajectory optimization. The prediction module adopts cognitive representations of interaction modes based on topological motion semantics and nearest neighbor relational encoding. With a differentiable modal loss and multimodal Gaussian decoding, CogDrive learns sparse and unbalanced interaction behaviors and improves long horizon trajectory prediction. The planning module incorporates an emergency response concept and optimizes safety stabilized trajectories, where short term consistent branches ensure safety during replanning cycles and long term branches support smooth and collision free motion under low probability switching modes. Experiments on Argoverse2 and INTERACTION datasets show that CogDrive achieves strong performance in trajectory accuracy and miss rate, while closed loop simulations confirm adaptive behavior in merge and intersection scenarios. By combining cognitive multimodal prediction with safety oriented planning, CogDrive offers an interpretable and reliable paradigm for safe autonomy in complex traffic.


Vehicle Dynamics Embedded World Models for Autonomous Driving

Li, Huiqian, Pan, Wei, Zhang, Haodong, Huang, Jin, Zhong, Zhihua

arXiv.org Artificial Intelligence

World models have gained significant attention as a promising approach for autonomous driving. By emulating human-like perception and decision-making processes, these models can predict and adapt to dynamic environments. Existing methods typically map high-dimensional observations into compact latent spaces and learn optimal policies within these latent representations. However, prior work usually jointly learns ego-vehicle dynamics and environmental transition dynamics from the image input, leading to inefficiencies and a lack of robustness to variations in vehicle dynamics. To address these issues, we propose the Vehicle Dynamics embedded Dreamer (VDD) method, which decouples the modeling of ego-vehicle dynamics from environmental transition dynamics. This separation allows the world model to generalize effectively across vehicles with diverse parameters. Additionally, we introduce two strategies to further enhance the robustness of the learned policy: Policy Adjustment during Deployment (PAD) and Policy Augmentation during Training (PAT). Comprehensive experiments in simulated environments demonstrate that the proposed model significantly improves both driving performance and robustness to variations in vehicle dynamics, outperforming existing approaches.


New Spiking Architecture for Multi-Modal Decision-Making in Autonomous Vehicles

Ghoreishee, Aref, Mishra, Abhishek, Zhou, Lifeng, Walsh, John, Kandasamy, Nagarajan

arXiv.org Artificial Intelligence

This work proposes an end-to-end multi-modal reinforcement learning framework for high-level decision-making in autonomous vehicles. The framework integrates heterogeneous sensory input, including camera images, LiDAR point clouds, and vehicle heading information, through a cross-attention transformer-based perception module. Although transformers have become the backbone of modern multi-modal architectures, their high computational cost limits their deployment in resource-constrained edge environments. To overcome this challenge, we propose a spiking temporal-aware transformer-like architecture that uses ternary spiking neurons for computationally efficient multi-modal fusion. Comprehensive evaluations across multiple tasks in the Highway Environment demonstrate the effectiveness and efficiency of the proposed approach for real-time autonomous decision-making.


MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction

Keskar, Maitrayee, Trivedi, Mohan, Greer, Ross

arXiv.org Artificial Intelligence

We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent's future 5-second trajectory in bird's-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.


Incorporating Ephemeral Traffic Waves in A Data-Driven Framework for Microsimulation in CARLA

Richardson, Alex, Hasan, Azhar, Karsai, Gabor, Sprinkle, Jonathan

arXiv.org Artificial Intelligence

This paper introduces a data-driven traffic microsimulation framework in CARLA that reconstructs real-world wave dynamics using high-fidelity time-space data from the I-24 MOTION testbed. Calibration of road networks in microsimulators to reproduce ephemeral phenomena such as traffic waves for large-scale simulation is a process that is fraught with challenges. This work reconsiders the existence of the traffic state data as boundary conditions on an ego vehicle moving through previously recorded traffic data, rather than reproducing those traffic phenomena in a calibrated microsim. Our approach is to autogenerate a 1 mile highway segment corresponding to I-24, and use the I-24 data to power a cosimulation module that injects traffic information into the simulation. The CARLA and cosimulation simulations are centered around an ego vehicle sampled from the empirical data, with autogeneration of "visible" traffic within the longitudinal range of the ego vehicle. Boundary control beyond these visible ranges is achieved using ghost cells behind (upstream) and ahead (downstream) of the ego vehicle. Unlike prior simulation work that focuses on local car-following behavior or abstract geometries, our framework targets full time-space diagram fidelity as the validation objective. Leveraging CARLA's rich sensor suite and configurable vehicle dynamics, we simulate wave formation and dissipation in both low-congestion and high-congestion scenarios for qualitative analysis. The resulting emergent behavior closely mirrors that of real traffic, providing a novel cosimulation framework for evaluating traffic control strategies, perception-driven autonomy, and future deployment of wave mitigation solutions. Our work bridges microscopic modeling with physical experimental data, enabling the first perceptually realistic, boundary-driven simulation of empirical traffic wave phenomena in CARLA.